Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish
نویسندگان
چکیده
This thesis presents Elliphant, a machine learning system for classifying Spanish subject ellipsis as either referential or non-referential. Linguistically motivated features are incorporated in a system which performs a ternary classification: verbs with explicit subjects, verbs with omitted but referential subjects (zero pronouns), and verbs with no subject (impersonal constructions). To the best of our knowledge, this is the first attempt to automatically identify non-referential ellipsis in Spanish. In order to enable a memory-based strategy, the eszic Corpus was created and manually annotated. The corpus is composed of Spanish legal and health texts and contains more than 6,800 annotated instances. A set of 14 features were defined and a separate training file was created, containing the instances represented as vectors of feature values. The training data was used with the Weka package and a set of optimization experiments was carried out to determine the best machine learning algorithm to use, the parameter optimization, the most effective combinations of features, the optimal number of instances needed to train the classifier, and the optimal settings for classifying instances occurring in different genres. A comparative evaluation of Elliphant with Connexor’s Machinese Syntax parser shows the superiority of our system. The overall accuracy of the system is 86.9%. Due to the fairly frequent elision of subjects in Spanish, this system is useful as the classification of elliptic subjects as referential or non-referential can improve the accuracy of Natural Language Processing where zero anaphora resolution is necessary, inter alia, for information extraction, machine translation, automatic summarization and text categorization.
منابع مشابه
A machine learning method for identifying impersonal constructions and zero pronouns in Spanish∗ Un método de aprendizaje automático para la identificación de construcciones impersonales y pronombres cero en español
In this paper, we present a machine learning system for classifying subject ellipsis in Spanish as either referential or non-referential. To the best of our knowledge, this is the first attempt to automatically identify non-referential ellipsis in Spanish. An evaluation of our system against 6,827 finite verbs shows an accuracy of 87%.
متن کاملElliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish
In pro-drop languages, the detection of explicit subjects, zero subjects and nonreferential impersonal constructions is crucial for anaphora and co-reference resolution. While the identification of explicit and zero subjects has attracted the attention of researchers in the past, the automatic identification of impersonal constructions in Spanish has not been addressed yet and this work is the ...
متن کاملMethods and Tools of Computational Linguistics for the Classification of Natural Non-referential Ellipsis in Spanish (review)
Vera Danilova Abstract: This article represents a brief survey of the few works, dedicated to the modern approaches of natural language processing (NLP) to the analysis of impersonal sentences in Spanish. Such an analysis consists in classification of non-referential ellipsis that can be used in machine translation systems. The NLP approaches related with Spanish are mainly based on the work of...
متن کاملA Portuguese-Spanish Corpus Annotated for Subject Realization and Referentiality
This paper presents a comparable corpus of Portuguese and Spanish consisting of legal and health texts. We describe the annotation of zero subject, impersonal constructions and explicit subjects in the corpus. We annotated 12,492 examples using a scheme that distinguishes between different linguistic levels (phonology, syntax, semantics, etc.) and present a taxonomy of instances on which annota...
متن کاملError Analysis for the Improvement of Subject Ellipsis Detection
This paper presents an analysis of the errors of a machine learning method that allow us to propose changes to improve it in future developments. The evaluated system detects Spanish subject ellipsis and yields an accuracy of 85.3%. We extract the erroneously classified instances of our training data (1,001) and classify the errors. We perform an analysis of these instances taking into account ...
متن کامل